Accurately testing for marijuana usage is very important in different everyday scenarios. For example, THC in marijuana can affect an individuals motor skills, depth perception, and overall cognition. This then can hinder their ability to work effectively and safely. According to the National Institute on Drug Abuse, it is noted that employees that tested positive for marijuana had an increase of 55% when it came to workplace accidents and that they were responsible for another increase of 85% of work-related injuries 1. These liabilities can hurt the company and more importantly, the individual; hence why it is paramount for companies to run effective drug tests on their employees. Another example why finding out which compound and matrix is most effective to use when conducting a drug test is for scenarios in which we’d like to find out if a driver is under the influence or not. Quoted directly from the National Highway Traffic Safety Administration, “In the 2013-2014 survey 2, 12.6 percent of weekend nighttime drivers tested positive for marijuana. That’s a 48-percent increase in less than 10 years” 3.
For our case study, the primary focus is to figure out which compound from which matrix (whole blood, breath, oral fluid) is the best bio-marker for recent use, with “recent use” definied to be within the last 3 hours.
library(tidymodels)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(janitor)
library(purrr)
library(rstatix)
library(cowplot)
The data we will be using comes from a placebo-controlled, double-blinded, randomized study published by Hoffman et. al. (2021). In the study, volunteers were randomly assigned to smoke a ciggertte containing placebo (0.02%), or 5.9% (low dose) or 13.4% (high dose) THC. Samples of whole blood (WB), oral fluid(OF), breath (BR) are taken before and after smoking.
WB = read.csv("data/Blood.csv")
BR = read.csv("data/Breath.csv")
OF = read.csv("data/OF.csv")
We re-coded and re-leveled variables (Treatment and
Group), cleans column names, and renames specific columns
(x11_oh_thc to thcoh, thc_v to
thcv, thccooh_gluc to
thc_cooh_gluc, and thccooh to
thc_cooh) in all 3 tables. Using janitor
package to organized column names.
OF <- OF |>
mutate(Treatment = fct_recode(Treatment,
"5.9% THC (low dose)" = "5.90%",
"13.4% THC (high dose)" = "13.40%"),
Treatment = fct_relevel(Treatment, "Placebo", "5.9% THC (low dose)"),
Group = fct_recode(Group,
"Occasional user" = "Not experienced user",
"Frequent user" = "Experienced user" )) |>
janitor::clean_names() |>
rename(thcoh = x11_oh_thc,
thcv = thc_v)
WB <- WB |>
mutate(Treatment = fct_recode(Treatment,
"5.9% THC (low dose)" = "5.90%",
"13.4% THC (high dose)" = "13.40%"),
Treatment = fct_relevel(Treatment, "Placebo", "5.9% THC (low dose)")) |>
janitor::clean_names() |>
rename(fluid = fluid_type,
thcoh = x11_oh_thc,
thccooh = thc_cooh,
thccooh_gluc = thc_cooh_gluc,
thcv = thc_v)
BR <- BR |>
mutate(Treatment = fct_recode(Treatment,
"5.9% THC (low dose)" = "5.90%",
"13.4% THC (high dose)" = "13.40%"),
Treatment = fct_relevel(Treatment, "Placebo", "5.9% THC (low dose)"),
Group = fct_recode(Group,
"Occasional user" = "Not experienced user",
"Frequent user" = "Experienced user" )) |>
janitor::clean_names() |>
rename(thc = thc_pg_pad)
compounds_WB <- as.list(colnames(Filter(function(x) !all(is.na(x)), WB[6:13])))
compounds_BR <- as.list(colnames(Filter(function(x) !all(is.na(x)), BR[6])))
compounds_OF <- as.list(colnames(Filter(function(x) !all(is.na(x)), OF[6:12])))
Then we created 3 tables based on specific minutes and labeled accordingly, covering pre-smoking and subsequent post-smoking time periods for blood, breath, and oral fluid data.
timepoints_WB <- tibble(
start = c(-400, 0, 30, 70, 100, 180, 210, 240, 270, 300),
stop = c(
0,
30,
70,
100,
180,
210,
240,
270,
300,
max(WB$time_from_start, na.rm = TRUE)
),
timepoint = c(
"pre-smoking",
"0-30 min",
"31-70 min",
"71-100 min",
"101-180 min",
"181-210 min",
"211-240 min",
"241-270 min",
"271-300 min",
"301+ min"
)
)
timepoints_BR <- tibble(
start = c(-400, 0, 40, 90, 180, 210, 240, 270),
stop = c(
0,
40,
90,
180,
210,
240,
270,
max(BR$time_from_start, na.rm = TRUE)
),
timepoint = c(
"pre-smoking",
"0-40 min",
"41-90 min",
"91-180 min",
"181-210 min",
"211-240 min",
"241-270 min",
"271+ min"
)
)
timepoints_OF <- tibble(
start = c(-400, 0, 30, 90, 180, 210, 240, 270),
stop = c(0, 30, 90, 180, 210, 240, 270,
max(OF$time_from_start, na.rm = TRUE)),
timepoint = c(
"pre-smoking",
"0-30 min",
"31-90 min",
"91-180 min",
"181-210 min",
"211-240 min",
"241-270 min",
"271+ min"
)
)
assign_timepoint <- function(x, timepoints) {
if (!is.na(x)) {
timepoints$timepoint[x > timepoints$start & x <= timepoints$stop]
} else{
NA
}
}
We created a new column, timepoint_use,
in each table by mapping the
time_from_start values to specific
timepoints defined in separate reference data frames
(timepoints_WB,
timepoints_OF,
timepoints_BR). Finally, re-leveled the
timepoint_use factor variable to align
with the order specified in the reference data frames. This ensures
consistent and meaningful timepoint labels for subsequent analyses or
visualizations in the study.
WB <- WB |>
mutate(timepoint_use = map_chr(time_from_start,
assign_timepoint,
timepoints=timepoints_WB),
timepoint_use = fct_relevel(timepoint_use, timepoints_WB$timepoint))
OF <- OF |>
mutate(timepoint_use = map_chr(time_from_start,
assign_timepoint,
timepoints=timepoints_OF),
timepoint_use = fct_relevel(timepoint_use, timepoints_OF$timepoint))
BR <- BR |>
mutate(timepoint_use = map_chr(time_from_start,
assign_timepoint,
timepoints=timepoints_BR),
timepoint_use = fct_relevel(timepoint_use, timepoints_BR$timepoint))
remove duplicate id
WB <- drop_dups(WB)
OF <- drop_dups(OF)
BR <- drop_dups(BR)
#im saving a copy of this clean OF for the extended question section.
OF_ex = drop_dups(OF)
The following plots include of all compounds against time, distinguished by color according to their respective groups. To achieve a comprehensive understanding, we generated scatterplots for compounds across three distinct matrices—namely, whole blood, oral fluid, and breath. This analysis encompasses various timepoints and considers different treatments, namely, placebo, low dose, and high dose.
Upon close examination of the scatterplots, a noteworthy observation emerges, particularly concerning the THC biomarker in whole blood. This specific biomarker appears to offer a potentially enhanced indication of recent cannabis joint usage. The scatterplot reveals a discernible separation between the placebo and THC treatment groups, suggesting that the THC measurement in whole blood may serve as a more reliable indicator of recent cannabis joint consumption.
scatter_WB <- map(compounds_WB, ~ compound_scatterplot_group_by_treatment(
dataset=WB,
compound=.x,
timepoints=timepoints_WB))
scatter_OF <- map(compounds_OF, ~ compound_scatterplot_group_by_treatment(
dataset=OF,
compound=.x,
timepoints=timepoints_OF))
scatter_BR <- map(compounds_BR, ~ compound_scatterplot_group_by_treatment(
dataset=BR,
compound=.x,
timepoints=timepoints_BR))
In the presented set of scatterplots, all compounds are graphically depicted against time, with color distinctions denoting different treatment conditions and a log transformation applied to the y-axis, which represents the respective compound measurements. A comparative analysis with the previous scatterplots reveals a modification: specifically, a log transformation has been applied to the y-axis, providing an alternative perspective on the measurement of the compounds.
Upon closer examination, a notable observation emerges. The measurement of THC from breath exhibits a more discernible separation between the placebo and THC treatment groups in the log-transformed scatterplots. This suggests that the log transformation on the y-axis enhances the visibility of distinctions between the treatment conditions for THC. The log transformation, by compressing the scale, may unveil nuances and patterns that are not as apparent on a linear scale. This nuanced insight into THC measurements underscores the importance of considering the impact of transformation techniques when analyzing compound data over time in the context of different treatments. The enhanced separation observed in the log-transformed scatterplots could potentially provide valuable insights into the effects of treatments on THC levels and underscores the sensitivity of the chosen visualization approach.
scatter_WB_by_treatment <- map(compounds_WB, ~ compound_scatterplot_group_by_treatment_log(
dataset=WB,
compound=.x,
timepoints=timepoints_WB))
scatter_OF_by_treatment <- map(compounds_OF, ~ compound_scatterplot_group_by_treatment_log(
dataset=OF,
compound=.x,
timepoints=timepoints_OF))
scatter_BR_by_treatment <- map(compounds_BR, ~ compound_scatterplot_group_by_treatment_log(
dataset=BR,
compound=.x,
timepoints=timepoints_BR))
We can see that some compounds’ measurement are clearly not useful. Here we remove them from all future analysis. The compounds removed include: WB: cbd, thccooh, thccooh_gluc, thcv OF: thcoh
compounds_WB = compounds_WB[- c(2, 5, 6, 8)]
compounds_OF = compounds_OF[- c(4)]
output_WB <- map_dfr(compounds_WB,
~ sens_spec_cpd(
dataset = WB,
cpd = all_of(.x),
timepoints = timepoints_WB
)) |> clean_gluc()
output_BR <- map_dfr(compounds_BR,
~ sens_spec_cpd(
dataset = BR,
cpd = all_of(.x),
timepoints = timepoints_BR
)) |> clean_gluc()
output_OF <- map_dfr(compounds_OF,
~ sens_spec_cpd(
dataset = OF,
cpd = all_of(.x),
timepoints = timepoints_OF
)) |> clean_gluc()
output_WB
## # A tibble: 2,470 × 17
## TP FN FP TN detection_limit compound time_start time_stop
## <dbl> <dbl> <int> <int> <dbl> <chr> <dbl> <dbl>
## 1 0 0 1 188 0.5 CBN -400 0
## 2 0 0 0 189 0.6 CBN -400 0
## 3 0 0 0 189 0.7 CBN -400 0
## 4 0 0 0 189 0.753 CBN -400 0
## 5 0 0 0 189 0.8 CBN -400 0
## 6 0 0 0 189 0.831 CBN -400 0
## 7 0 0 0 189 0.9 CBN -400 0
## 8 0 0 0 189 0.909 CBN -400 0
## 9 0 0 0 189 1 CBN -400 0
## 10 0 0 0 189 1.1 CBN -400 0
## # ℹ 2,460 more rows
## # ℹ 9 more variables: time_window <chr>, NAs <int>, N <int>, N_removed <int>,
## # Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
## # Efficiency <dbl>
output_BR
## # A tibble: 816 × 17
## TP FN FP TN detection_limit compound time_start time_stop
## <dbl> <dbl> <int> <int> <dbl> <chr> <dbl> <dbl>
## 1 0 0 6 185 0.5 THC -400 0
## 2 0 0 6 185 80.1 THC -400 0
## 3 0 0 6 185 85.6 THC -400 0
## 4 0 0 6 185 86.4 THC -400 0
## 5 0 0 6 185 91.8 THC -400 0
## 6 0 0 6 185 93.8 THC -400 0
## 7 0 0 6 185 98.2 THC -400 0
## 8 0 0 6 185 105. THC -400 0
## 9 0 0 6 185 105. THC -400 0
## 10 0 0 6 185 107. THC -400 0
## # ℹ 806 more rows
## # ℹ 9 more variables: time_window <chr>, NAs <int>, N <int>, N_removed <int>,
## # Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
## # Efficiency <dbl>
output_OF
## # A tibble: 3,928 × 17
## TP FN FP TN detection_limit compound time_start time_stop
## <dbl> <dbl> <int> <int> <dbl> <chr> <dbl> <dbl>
## 1 0 0 5 187 0.5 CBN -400 0
## 2 0 0 8 184 0.4 CBN -400 0
## 3 0 0 5 187 0.48 CBN -400 0
## 4 0 0 3 189 0.6 CBN -400 0
## 5 0 0 3 189 0.7 CBN -400 0
## 6 0 0 3 189 0.8 CBN -400 0
## 7 0 0 3 189 0.9 CBN -400 0
## 8 0 0 1 191 1 CBN -400 0
## 9 0 0 1 191 1.1 CBN -400 0
## 10 0 0 1 191 1.16 CBN -400 0
## # ℹ 3,918 more rows
## # ℹ 9 more variables: time_window <chr>, NAs <int>, N <int>, N_removed <int>,
## # Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
## # Efficiency <dbl>
Here we plot the value of the cutoff against sensitivity and specificity for every compound in every matrix, and arrange them all into one big plot. This is also known as the ROC curve of sensitivity and specificity against cutoff values suggests an exploration of optimal cutoff points. Overall, the specificity of all compounds increases when detection limit rises. On the other hand, sensitivity drops to zero when detection limit rises.
#arranges ss plots into one
ss_bottom_row <-
plot_grid(
ss_OF,
ss_BR,
labels = c('B', 'C'),
label_size = 12,
ncol = 2,
rel_widths = c(0.66, .33)
)
plot_grid(
ss_WB,
ss_bottom_row,
labels = c('A', ''),
label_size = 12,
ncol = 1
)
output_WB_avg = average_sens_spec(output = output_WB)
output_OF_avg = average_sens_spec(output = output_OF)
output_BR_avg = average_sens_spec(output = output_BR)
ss_WB_avg_together <-
ss_plot_avg_together(output_WB_avg, tpts = length(unique(output_WB$time_start)), tissue = "Blood")
ss_OF_avg_together <-
ss_plot_avg_together(output_OF_avg, tpts = length(unique(output_WB$time_start)), tissue = "Oral Fluid")
ss_BR_avg_together <-
ss_plot_avg_together(output_BR_avg, tpts = length(unique(output_WB$time_start)), tissue = "Breath")
It should be apparent that OF-THC is the superior choice. now we dig deeper into OF-THC and find the specific cutoff. referring back to the Average sensitivity and specificity vs. detection limit plot, we see that the detection limit is at…very close to 0 when both sensitivity and specificity are high. Let’s try out some more cutoffs close to 0.
i will now remove every compound where the average sens and spec does not intersect. reasoning: for compounds with no intersection, optimal sensitivity (left most point of the graph) = worst specificity. there is no room for adjustment because any adjustment from there on would just make everything worse.
compounds_WB = c("thc")
compounds_OF = c("thc")
compounds_BR = NULL
In this visual representation, we graph the sensitivity against specificity for each compound within every matrix, consolidating the data into a comprehensive plot. This collective visualization allows for a convenient comparison of the performance of various biomarkers concerning their specificity and sensitivity.
output_WB <- map_dfr(compounds_WB,
~ sens_spec_cpd(
dataset = WB,
cpd = all_of(.x),
timepoints = timepoints_WB
)) |> clean_gluc()
output_OF <- map_dfr(compounds_OF,
~ sens_spec_cpd(
dataset = OF,
cpd = all_of(.x),
timepoints = timepoints_OF
)) |> clean_gluc()
#plot sensitivity vs. specificity
roc_WB = roc_plot(output_WB, tpts = length(unique(output_WB$time_start)), tissue = "Blood")
roc_OF = roc_plot(output_OF, tpts = length(unique(output_OF$time_start)), tissue = "Oral Fluid")
# #arrange roc plots
# roc_bottom_row <-
# plot_grid(
# roc_OF,
# roc_BR,
# labels = c('B', 'C'),
# label_size = 12,
# ncol = 2,
# rel_widths = c(0.66, .33)
# )
# plot_grid(
# roc_WB,
# roc_bottom_row,
# labels = c('A', ''),
# label_size = 12,
# ncol = 1
# )
It should be apparent that OF-THC is the superior choice. now we dig deeper into OF-THC and find the specific cutoff. referring back to the Average sensitivity and specificity vs. detection limit plot, we see that the detection limit is at…very close to 0 when both sensitivity and specificity are high. Let’s try out some more cutoffs close to 0.
Taking a deeper dive into sensitivity and specificity over time over
time for the measurement of THC in Blood
and Oral Fluid tissues. In direct comparison between
the two measurement methods of THC, it becomes evident that
Oral Fluid outshines its counterpart in terms of both
sensitivity and specificity, particularly within the critical time span
of three hours post-smoking.
#pass specific cutoff into splits parameter
OF_THC <- sens_spec_cpd(
dataset = OF,
cpd = 'thc',
timepoints = timepoints_OF,
splits = c(0.5, 1, 2, 5, 10)
) |> clean_gluc()
of_levels <- c("pre-smoking\nN=192", "0-30\nmin\nN=192", "31-90\nmin\nN=117",
"91-180\nmin\nN=99", "181-210\nmin\nN=102", "211-240\nmin\nN=83",
"241-270\nmin\nN=90", "271+\nmin\nN=76")
plot_cutoffs(dataset=OF_THC,
timepoint_use_variable=OF$timepoint_use,
tissue="Oral Fluid",
cpd="THC",
x_labels=NULL)
## [[1]]
##
## [[2]]
## # A tibble: 40 × 18
## TP FN FP TN detection_limit compound time_start time_stop
## <dbl> <dbl> <int> <int> <fct> <chr> <dbl> <dbl>
## 1 0 0 35 157 0.5 THC -400 0
## 2 0 0 20 172 1 THC -400 0
## 3 0 0 9 183 2 THC -400 0
## 4 0 0 0 192 5 THC -400 0
## 5 0 0 0 192 10 THC -400 0
## 6 129 0 39 24 0.5 THC 0 30
## 7 129 0 30 33 1 THC 0 30
## 8 128 1 19 44 2 THC 0 30
## 9 128 1 3 60 5 THC 0 30
## 10 125 4 1 62 10 THC 0 30
## # ℹ 30 more rows
## # ℹ 10 more variables: time_window <fct>, NAs <int>, N <int>, N_removed <int>,
## # Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
## # Efficiency <dbl>, my_label <fct>
the average sensitivity is a lot more sensitive (ha) to change than the average specificity - specificity only dips in the 31-90min window when the cutoff is lowered, whereas a lower cutoff increases overall sensitivity all across the board, no matter the time.
the average sensitivity is a lot more sensitive (ha) to change than the average specificity - specificity only dips in the 31-90min window when the cutoff is lowered, whereas a lower cutoff increases overall sensitivity all across the board, no matter the time. additionally, this 31-90min window where the specificity is heavily effected by a low cutoff is, in my opinion, trivial. it should be quite apparent that someone is high if they smoked within the last 90 min. a lowered specificity in this time frame isnt cause for too much concern, considering how much sensitivity is gained via using a low cutoff.
in a nutshell: a low cutoff is optimal. approxiamately somewhere between 0-2. let’s test more cutoffs in this range:
OF_THC <- sens_spec_cpd(
dataset = OF,
cpd = 'thc',
timepoints = timepoints_OF,
splits = c(0.1, 0.25, 0.5, 1, 1.5)
) |> clean_gluc()
blood_levels <- c("pre-smoking\nN=189", "0-30\nmin\nN=187", "31-70\nmin\nN=165",
"71-100\nmin\nN=157", "101-180\nmin\nN=168", "181-210\nmin\nN=103",
"211-240\nmin\nN=127", "241-270\nmin\nN=137", "271-300\nmin\nN=120",
"301+\nmin\nN=88")
of_levels <- c("pre-smoking\nN=192", "0-30\nmin\nN=192", "31-90\nmin\nN=117",
"91-180\nmin\nN=99", "181-210\nmin\nN=102", "211-240\nmin\nN=83",
"241-270\nmin\nN=90", "271+\nmin\nN=76")
plot_cutoffs(dataset=OF_THC,
timepoint_use_variable=OF$timepoint_use,
tissue="Oral Fluid",
cpd="THC",
x_labels=NULL)
## [[1]]
##
## [[2]]
## # A tibble: 40 × 18
## TP FN FP TN detection_limit compound time_start time_stop
## <dbl> <dbl> <int> <int> <fct> <chr> <dbl> <dbl>
## 1 0 0 37 155 0.1 THC -400 0
## 2 0 0 37 155 0.25 THC -400 0
## 3 0 0 35 157 0.5 THC -400 0
## 4 0 0 20 172 1 THC -400 0
## 5 0 0 12 180 1.5 THC -400 0
## 6 129 0 44 19 0.1 THC 0 30
## 7 129 0 44 19 0.25 THC 0 30
## 8 129 0 39 24 0.5 THC 0 30
## 9 129 0 30 33 1 THC 0 30
## 10 128 1 23 40 1.5 THC 0 30
## # ℹ 30 more rows
## # ℹ 10 more variables: time_window <fct>, NAs <int>, N <int>, N_removed <int>,
## # Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
## # Efficiency <dbl>, my_label <fct>
they all look pretty promising… we need a way to quantify this. I am gonna calculate the sensitivity and specificity for cutoff values in between 0 and 2.
output_OF = sens_spec_cpd_OFTHC(
dataset = OF,
cpd = "thc",
timepoints = timepoints_OF
) |> clean_gluc()
output_OF_avg = average_sens_spec(output = output_OF)
output_OF_avg
## # A tibble: 101 × 4
## compound detection_limit average_sensitivity average_specificity
## <chr> <dbl> <dbl> <dbl>
## 1 THC 0 0.956 0
## 2 THC 0.02 0.956 0.817
## 3 THC 0.04 0.956 0.817
## 4 THC 0.06 0.956 0.817
## 5 THC 0.08 0.956 0.817
## 6 THC 0.1 0.956 0.817
## 7 THC 0.12 0.956 0.817
## 8 THC 0.14 0.956 0.817
## 9 THC 0.16 0.956 0.817
## 10 THC 0.18 0.956 0.817
## # ℹ 91 more rows
lets plot this really quick
ss_OF_avg_together <-
ss_plot_avg_together(output_OF_avg, tpts = length(unique(output_WB$time_start)), tissue = "Oral Fluid")
oh we found it. the place where they intersect is the maximum of sensitivity+specificity. lets get the specific value
output_OF_avg |>
mutate(diff = abs(average_sensitivity-average_specificity)) |>
arrange(diff)
## # A tibble: 101 × 5
## compound detection_limit average_sensitivity average_specificity diff
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 THC 0.82 0.893 0.890 0.00338
## 2 THC 0.84 0.893 0.890 0.00338
## 3 THC 0.86 0.893 0.890 0.00338
## 4 THC 0.88 0.893 0.890 0.00338
## 5 THC 0.9 0.893 0.890 0.00338
## 6 THC 0.7 0.903 0.879 0.0240
## 7 THC 0.72 0.903 0.879 0.0240
## 8 THC 0.74 0.903 0.879 0.0240
## 9 THC 0.76 0.903 0.879 0.0240
## 10 THC 0.78 0.903 0.879 0.0240
## # ℹ 91 more rows
at cutoff 0.82-0.90, the difference between the average sensitivity and average specificity is minimized. we’ll pick 0.85 for aestheicism’s sake.
During our thorough exploratory data analysis, a discernible pattern
emerged, highlighting the potency of the THC compound in
effectively indicating recent marijuana usage. Consequently, our focus
in the data analysis section specifically hones in on the sensitivity
and specificity cutoff measurements within the Blood
and Oral Fluid tissues. Notably, Oral
Fluid consistently exhibits superior sensitivity across all
timepoints. By scrutinizing the Receiver Operating Characteristic (ROC)
curve comparing the THC measurements in Blood and Oral
Fluid, a clear trend emerges – Oral Fluid surpasses Blood in accuracy.
In simpler terms, Oral Fluid proves more adept at
detecting recent marijuana joint usage compared to its
Blood counterpart. This confirmation underscores the
THC measurement in Oral Fluid as the paramount biomarker
for recent use.
Moving forward, our focus shifts to determining the optimal cutoff
values for both sensitivity and specificity. To achieve this, we
calculate the average sensitivity and specificity across all time
windows for various detection limits, identifying the point at which
these metrics intersect – a key indicator of the optimal cutoff. Upon
plotting the graph and meticulously examining the associated table, a
convergence becomes evident at detection limits ranging from
0.82 to 0.90. Hence, it becomes apparent that
the 0.82 to 0.90 detection limit of the
THC compound in Oral Fluid stands out as the most effective
biomarker for recent use.
Moreover, our exploration extends to a broader question. The
visualizations shed light on the distinct responses of various compounds
to marijuana dosage. Notably, we focus on key comparisons:
CBG with THC in Blood, CBN with
CBG in Oral Fluid, and CBG with
THCV in Oral Fluid. The analysis of CBG with
THC in Blood reveals a noteworthy observation – the
coefficient of THC in the low dose group significantly
exceeds that in the high dose group. This implies that, for the low dose
group, each increment in THC correlates more strongly with
CBG compared to the high dose group. Shifting attention to
the comparison between CBN with CBG in Oral
Fluid, the coefficient of CBG in the high dose group is
notably higher. This suggests that as the dosage increases, a heightened
correlation emerges between CBG and CBN. These
nuanced insights deepen our understanding of compound interactions in
response to varying marijuana doses.
The results of our case study on biomarkers of recent marijuana use reveal intriguing insights into the choice of compound, matrix, and cutoff for effective detection. The overarching goal was to identify the most potent biomarker that, when combined with a specific matrix and cutoff, can reliably indicate recent marijuana use within a three-hour timeframe.
Our analysis consistently points to THC in oral fluid as
the optimal biomarker for recent marijuana use. The scatterplots
depicting compound measurements over time clearly show a distinct
separation between the placebo and THC treatment groups,
particularly in oral fluid. The ROC curve analysis further supports this
conclusion, with oral fluid THC demonstrating superior
sensitivity across all time points compared to other matrices. The
enhanced sensitivity in oral fluid makes it a compelling choice for
detecting recent marijuana use accurately.
To determine the most effective cutoff for sensitivity and
specificity, we conducted a thorough analysis, revealing that a cutoff
value between 0.82 and 0.90 maximizes the
balance between sensitivity and specificity. We opted for
0.85 as the optimal cutoff, considering the intersection
point where the difference between average sensitivity and specificity
is minimized. This ensures a practical and balanced approach for
identifying recent marijuana use.
there are some other possible ways of going about this… for example, someone who is quite recently high should be obviously high to an officer, so it is viable to give the samples in the 0-31min time window less weight when considering overall sensitivity and specificity. my reasoning for not doing is because any sort of “weight” would be quite arbitrary and meaningless.
the optimal biomarker is OF,THC, at cutoff = 0.85.
WB_long = WB |>
pivot_longer(6:13, names_to = "compound")
OF_long = OF |>
pivot_longer(6:12, names_to = "compound")
BR_long <- BR |> pivot_longer(6)
df_full <- bind_rows(WB_long, OF_long, BR_long)
OF = OF_ex
compounds_OF <- as.list(colnames(Filter(function(x) !all(is.na(x)), OF[6:12])))
pairs(OF[,unlist(compounds_OF)],
pch=19,
cex=0.4,
cex.labels=0.6,
labels=gsub('GLUC','gluc',gsub("_","-",toupper(colnames(OF[,unlist(compounds_OF)])))))
This is a pair plot showing the correlation between different compounds in th OF matrix. Notice how some of them look like two separate lines. This should be a clear sign that there’s some underlying variable that effects both of the compounds. After some trial and error, we found that this observation is due to the treatment variable - specifically, low dose and high dose group seems to have different slopes when it comes to the correlation between certain compounds.
Here’s a few of the more obvious ones:
OF = OF_ex |>
filter(treatment != "Placebo")
ggplot(data = OF,
mapping = aes(y = thc, x = cbn, color = treatment)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x)
ggplot(data = OF,
mapping = aes(y = thc, x = cbg, color = treatment)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x)
Also, we are removing the placebo group since it is not of interest.
This seems to show that low dose/high dose they change the correlation between chemicals. Interesting. For this report, we will narrow our target of investigation to THC vs CBG in OF.
If there is, in fact, some significant difference between compound correlation between different treatment, we should be able to construct some sort of model that, given a sample of the measurement of those two compounds, would be able to predict which treatment group this sample is pulled from to some meaningful degree of accuracy.
Let’s do some analysis.
First we try fitting a single variable linear model with THC as the value, and CBG as the explanatory variable.
lm_cbg = linear_reg() |>
set_engine("lm") |>
fit(thc ~ cbg, data = OF)
lm_cbg
## parsnip model object
##
## Fit time: 4ms
##
## Call:
## stats::lm(formula = thc ~ cbg, data = data)
##
## Coefficients:
## (Intercept) cbg
## 60.30 14.35
glance(lm_cbg)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.850 0.849 446. 3581. 5.65e-263 1 -4782. 9570. 9583.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Here the resulting linear model is THC = 14.41CBG + 39.8 with an R^2 of 0.85.
Now we test if the addition of treatment significantly improves the model by doing multiple linear regression:
#multiple linear regression model with only main effect
mlm_cbg1 = linear_reg() |>
set_engine("lm") |>
fit(thc ~ cbg + treatment, data = OF)
#multiple linear regression model with main effect + interaction effect
mlm_cbg2 = linear_reg() |>
set_engine("lm") |>
fit(thc ~ cbg * treatment, data = OF)
mlm_cbg1
## parsnip model object
##
## Fit time: 4ms
##
## Call:
## stats::lm(formula = thc ~ cbg + treatment, data = data)
##
## Coefficients:
## (Intercept) cbg
## -16.64 14.45
## treatment13.4% THC (high dose)
## 156.34
glance(mlm_cbg1)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.854 0.854 440. 1853. 2.41e-265 2 -4772. 9552. 9570.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
mlm_cbg2
## parsnip model object
##
## Fit time: 2ms
##
## Call:
## stats::lm(formula = thc ~ cbg * treatment, data = data)
##
## Coefficients:
## (Intercept) cbg
## 48.4663 11.6821
## treatment13.4% THC (high dose) cbg:treatment13.4% THC (high dose)
## 0.5256 12.9414
glance(mlm_cbg2)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.969 0.969 202. 6652. 0 3 -4277. 8563. 8585.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Adding in treatment as another explanatory variable and including interaction effects produces a noticeably more accurate model - with a R^2 of ~0.97.
Just to be thorough, let’s try another way. Here we fit a linear model for high does and low dose group separately.
OF_low <- OF |>
filter(treatment == "5.9% THC (low dose)")
OF_high <- OF |>
filter(treatment == "13.4% THC (high dose)")
#linear regression `cbn ~ cbg` for 5.9% THC (low dose) group
lr_cbg_low = linear_reg() |>
set_engine("lm") |>
fit(cbn ~ cbg, data = OF_low)
glance(lr_cbg_low)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.945 0.945 34.7 5641. 1.02e-208 1 -1638. 3282. 3293.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
#linear regression `cbn ~ cbg` for 13.4% THC (high dose) group
lr_cbg_high = linear_reg() |>
set_engine("lm") |>
fit(cbn ~ cbg, data = OF_high)
glance(lr_cbg_high)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.913 0.913 67.8 3191. 3.01e-163 1 -1724. 3453. 3464.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
We attempted four different ways of modeling the relationship between CBG and THC: directly fitting a linear model with CBG as the only explanatory variable, fitting a multiple linear model with both CBG and treatment as explanatory variables but only with main effect, multiple linear model with both CBG and treatment as explanatory variables with main effect and interaction effect, and finally fitting a linear model for the low dose and high dose group separately. Out of all these models, the multiple linear model with two explanatory variables and interaction effect proves to be the best model, explaining almost 97% of all observed data. This is quite impressive, especially considering that even the linear models that are fitted onto high dose and low dose group separately did not achieve this level of accuracy.
This result has both practical implications and potential to serve as grounf for further research. On the one hand, being able to accurate predict the value of another measurement based on one measurement and information about the dosage of THC can significantly lower the cost, time, and energy required for data collection. On the other hand, this predictive power of dosage in terms of how certain compounds relate to each other seem to suggest a non-linear relationship between dosage and certain compound’s value. Further research is required to investigate this proposition.
Research Note: Results of the 2013-2014 National Roadside Survey of Alcohol and Drug Use by Drivers. NHTSA. https://www.nhtsa.gov/sites/nhtsa.gov/files/812118-roadside_survey_2014.pdf↩︎
Drug-Impaired Driving. NHTSA. https://www.nhtsa.gov/risky-driving/drug-impaired-driving#:~:text=In%202007%2C%20NHTSA’s%20National%20Roadside,in%20less%20than%2010%20years.↩︎